AITopics | plagiarism detection

Collaborating Authors

plagiarism detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Bin2Vec: Interpretable and Auditable Multi-View Binary Analysis for Code Plagiarism Detection

Moussaoui, Moussa, Houichime, Tarik, Sadiq, Abdelalim

arXiv.org Artificial IntelligenceDec-3-2025

We introduce Bin2Vec, a new framework that helps compare software programs in a clear and explainable way. Instead of focusing only on one type of information, Bin2Vec combines what a program looks like (its built-in functions, imports, and exports) with how it behaves when it runs (its instructions and memory usage). This gives a more complete picture when deciding whether two programs are similar or not. Bin2Vec represents these different types of information as views that can be inspected separately using easy-to-read charts, and then brings them together into an overall similarity score. Bin2Vec acts as a bridge between binary representations and machine learning techniques by generating feature representations that can be efficiently processed by machine-learning models. We tested Bin2Vec on multiple versions of two well-known Windows programs, PuTTY and 7-Zip. The primary results strongly confirmed that our method compute an optimal and visualization-friendly representation of the analyzed software. For example, PuTTY versions showed more complex behavior and memory activity, while 7-Zip versions focused more on performance-related patterns. Overall, Bin2Vec provides decisions that are both reliable and explainable to humans. Because it is modular and easy to extend, it can be applied to tasks like auditing, verifying software origins, or quickly screening large numbers of programs in cybersecurity and reverse-engineering work.

detection, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2512.02197

Country:

North America > United States (0.46)
Africa > Middle East > Morocco (0.28)
North America > Canada (0.28)

Genre: Research Report (0.40)

Industry:

Information Technology > Security & Privacy (1.00)
Government > Regional Government > North America Government > United States Government (0.46)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.45)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.89)

Add feedback

KurdSTS: The Kurdish Semantic Textual Similarity

Abdullah, Abdulhady Abas, Veisi, Hadi, Al, Hussein M.

arXiv.org Artificial IntelligenceDec-1-2025

Semantic Textual Similarity measures the degree of equivalence between the two texts and is important in many Natural Language Processing tasks. While extensive resources have been developed for high - resource languages, unfortunately, low - resource languages, for example, Kurdish, have been neglected. In this paper, the first STS dataset for K urdish has been introduced, which aims to alleviate this gap. This dataset contains 10,000 formal and informal sentence pairs annotated for similarity. To this end, aft er benchmarking several models, such as Sentence Bidirectional Encoder Representations from Transformers (Sentence - BERT) and multilingual Bidirectional Encoder Representations from Transformers (multilingual BERT), among others, which achieved promising results while also showcasing the difficulties presented by the distinctive nature of Kurdish. This work paves the way for future studies in Kurdish semantic research and Natural Language Processing in general for other low - resource languages.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.02336

Country: Asia > Middle East > Iraq > Kurdistan Region (0.14)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.87)

Add feedback

MelodySim: Measuring Melody-aware Music Similarity for Plagiarism Detection

Lu, Tongyu, Geist, Charlotta-Marlena, Melechovsky, Jan, Roy, Abhinaba, Herremans, Dorien

arXiv.org Artificial IntelligenceNov-20-2025

We propose MelodySim, a melody-aware music similarity model and dataset for plagiarism detection. First, we introduce a novel method to construct a dataset focused on melodic similarity. By augmenting Slakh2100, an existing MIDI dataset, we generate variations of each piece while preserving the melody through modifications such as note splitting, arpeggiation, minor track dropout, and re-instrumentation. A user study confirms that positive pairs indeed contain similar melodies, while other musical tracks are significantly changed. Second, we develop a segment-wise melodic-similarity detection model that uses a MERT encoder and applies a triplet neural network to capture melodic similarity. The resulting decision matrix highlights where plagiarism might occur. The experiments show that our model is able to outperform baseline models in detecting similar melodic fragments on the MelodySim test set.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2505.20979

Country: Europe (0.46)

Genre: Research Report (1.00)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Law (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.63)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Real-world Music Plagiarism Detection With Music Segment Transcription System

Go, Seonghyeon

arXiv.org Artificial IntelligenceSep-11-2025

As a result of continuous advances in Music Information Retrieval (MIR) technology, generating and distributing music has become more diverse and accessible. In this context, interest in music intellectual property protection is increasing to safeguard individual music copyrights. In this work, we propose a system for detecting music plagiarism by combining various MIR technologies. We developed a music segment transcription system that extracts musically meaningful segments from audio recordings to detect plagiarism across different musical formats. With this system, we compute similarity scores based on multiple musical features that can be evaluated through comprehensive musical analysis. Our approach demonstrated promising results in music plagiarism detection experiments, and the proposed method can be applied to real-world music scenarios. We also collected a Similar Music Pair (SMP) dataset for musical similarity research using real-world cases. The dataset are publicly available.

artificial intelligence, machine learning, natural language, (14 more...)

arXiv.org Artificial Intelligence

2509.08282

Genre: Research Report (0.82)

Industry:

Media > Music (1.00)
Leisure & Entertainment (1.00)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.76)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing

Mutsaddi, Atharva, Choudhary, Aditya

arXiv.org Artificial IntelligenceJan-9-2025

Plagiarism involves using another person's work or concepts without proper attribution, presenting them as original creations. With the growing amount of data communicated in regional languages such as Marathi -- one of India's regional languages -- it is crucial to design robust plagiarism detection systems tailored for low-resource languages. Language models like Bidirectional Encoder Representations from Transformers (BERT) have demonstrated exceptional capability in text representation and feature extraction, making them essential tools for semantic analysis and plagiarism detection. However, the application of BERT for low-resource languages remains under-explored, particularly in the context of plagiarism detection. This paper presents a method to enhance the accuracy of plagiarism detection for Marathi texts using BERT sentence embeddings in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) feature representation. This approach effectively captures statistical, semantic, and syntactic aspects of text features through a weighted voting ensemble of machine learning models.

bert, detection, representation, (16 more...)

arXiv.org Artificial Intelligence

2501.0526

Country:

Asia > India (0.24)
Europe > Finland > Uusimaa > Helsinki (0.05)
Europe > Portugal > Lisbon > Lisbon (0.04)
(2 more...)

Genre: Research Report > New Finding (0.94)

Industry: Education > Educational Technology > Educational Software > Computer-Aided Assessment (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.95)

Add feedback

Leveraging Explainable AI for LLM Text Attribution: Differentiating Human-Written and Multiple LLMs-Generated Text

Najjar, Ayat, Ashqar, Huthaifa I., Darwish, Omar, Hammad, Eman

arXiv.org Artificial IntelligenceJan-6-2025

The development of Generative AI Large Language Models (LLMs) raised the alarm regarding identifying content produced through generative AI or humans. In one case, issues arise when students heavily rely on such tools in a manner that can affect the development of their writing or coding skills. Other issues of plagiarism also apply. This study aims to support efforts to detect and identify textual content generated using LLM tools. We hypothesize that LLMs-generated text is detectable by machine learning (ML), and investigate ML models that can recognize and differentiate texts generated by multiple LLMs tools. We leverage several ML and Deep Learning (DL) algorithms such as Random Forest (RF), and Recurrent Neural Networks (RNN), and utilized Explainable Artificial Intelligence (XAI) to understand the important features in attribution. Our method is divided into 1) binary classification to differentiate between human-written and AI-text, and 2) multi classification, to differentiate between human-written text and the text generated by the five different LLM tools (ChatGPT, LLaMA, Google Bard, Claude, and Perplexity). Results show high accuracy in the multi and binary classification. Our model outperformed GPTZero with 98.5\% accuracy to 78.3\%. Notably, GPTZero was unable to recognize about 4.2\% of the observations, but our model was able to recognize the complete test dataset. XAI results showed that understanding feature importance across different classes enables detailed author/source profiles. Further, aiding in attribution and supporting plagiarism detection by highlighting unique stylistic and structural elements ensuring robust content originality verification.

classification, large language model, machine learning, (21 more...)

arXiv.org Artificial Intelligence

2501.03212

Country: North America > United States > Texas > Brazos County > College Station (0.14)

Genre: Research Report > New Finding (1.00)

Industry:

Health & Medicine > Therapeutic Area (0.68)
Government > Voting & Elections (0.47)
Education > Educational Technology > Educational Software (0.35)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.54)

Add feedback

PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection

Lee, Jooyoung, Agrawal, Toshini, Uchendu, Adaku, Le, Thai, Chen, Jinghui, Lee, Dongwon

arXiv.org Artificial IntelligenceJun-23-2024

Recent literature has highlighted potential risks to academic integrity associated with large language models (LLMs), as they can memorize parts of training instances and reproduce them in the generated texts without proper attribution. In addition, given their capabilities in generating high-quality texts, plagiarists can exploit LLMs to generate realistic paraphrases or summaries indistinguishable from original work. In response to possible malicious use of LLMs in plagiarism, we introduce PlagBench, a comprehensive dataset consisting of 46.5K synthetic plagiarism cases generated using three instruction-tuned LLMs across three writing domains. The quality of PlagBench is ensured through fine-grained automatic evaluation for each type of plagiarism, complemented by human annotation. We then leverage our proposed dataset to evaluate the plagiarism detection performance of five modern LLMs and three specialized plagiarism checkers. Our findings reveal that GPT-3.5 tends to generates paraphrases and summaries of higher quality compared to Llama2 and GPT-4. Despite LLMs' weak performance in summary plagiarism identification, they can surpass current commercial plagiarism detectors. Overall, our results highlight the potential of LLMs to serve as robust plagiarism detection tools.

llm, plagiarism, source text, (15 more...)

arXiv.org Artificial Intelligence

2406.16288

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
North America > United States > Pennsylvania > Centre County > University Park (0.04)
(8 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Technology (0.57)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Survey on Plagiarism Detection in Large Language Models: The Impact of ChatGPT and Gemini on Academic Integrity

Pudasaini, Shushanta, Miralles-Pechuán, Luis, Lillis, David, Salvador, Marisa Llorens

arXiv.org Artificial IntelligenceJun-4-2024

The rise of Large Language Models (LLMs) such as ChatGPT and Gemini has posed new challenges for the academic community. With the help of these models, students can easily complete their assignments and exams, while educators struggle to detect AI-generated content. This has led to a surge in academic misconduct, as students present work generated by LLMs as their own, without putting in the effort required for learning. As AI tools become more advanced and produce increasingly human-like text, detecting such content becomes more challenging. This development has significantly impacted the academic world, where many educators are finding it difficult to adapt their assessment methods to this challenge. This research first demonstrates how LLMs have increased academic dishonesty, and then reviews state-of-the-art solutions for academic plagiarism in detail. A survey of datasets, algorithms, tools, and evasion strategies for plagiarism detection has been conducted, focusing on how LLMs and AI-generated content (AIGC) detection have affected this area. The survey aims to identify the gaps in existing solutions. Lastly, potential long-term solutions are presented to address the issue of academic plagiarism using LLMs based on AI tools and educational approaches in an ever-changing world.

arxiv preprint arxiv, chatgpt, detection, (12 more...)

arXiv.org Artificial Intelligence

2407.13105

Country:

North America > United States > Michigan (0.04)
North America > United States > New York > New York County > New York City (0.04)
North America > Canada > British Columbia (0.04)
(4 more...)

Genre:

Overview (1.00)
Research Report > Promising Solution (0.34)

Industry:

Education > Educational Setting > Higher Education (0.93)
Education > Curriculum > Subject-Specific Education (0.93)
Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.64)
Education > Social Development & Welfare > Conduct & Behavior (0.61)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.31)

Add feedback

BERT-Enhanced Retrieval Tool for Homework Plagiarism Detection System

Xian, Jiarong, Yuan, Jibao, Zheng, Peiwei, Chen, Dexian

arXiv.org Artificial IntelligenceApr-1-2024

Text plagiarism detection task is a common natural language processing task that aims to detect whether a given text contains plagiarism or copying from other texts. In existing research, detection of high level plagiarism is still a challenge due to the lack of high quality datasets. In this paper, we propose a plagiarized text data generation method based on GPT-3.5, which produces 32,927 pairs of text plagiarism detection datasets covering a wide range of plagiarism methods, bridging the gap in this part of research. Meanwhile, we propose a plagiarism identification method based on Faiss with BERT with high efficiency and high accuracy. Our experiments show that the performance of this model outperforms other models in several metrics, including 98.86\%, 98.90%, 98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score, respectively. At the end, we also provide a user-friendly demo platform that allows users to upload a text library and intuitively participate in the plagiarism analysis.

dataset, plagiarism, plagiarism strategy, (14 more...)

arXiv.org Artificial Intelligence

2404.01582

Country: Asia > China (0.05)

Genre: Research Report (1.00)

Industry: Education > Educational Technology > Educational Software > Computer-Aided Assessment (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.90)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.82)
(2 more...)

Add feedback

Deep Learning Detection Method for Large Language Models-Generated Scientific Content

Alhijawi, Bushra, Jarrar, Rawan, AbuAlRub, Aseel, Bader, Arwa

arXiv.org Artificial IntelligenceFeb-27-2024

Large Language Models (LLMs), such as GPT-3 and BERT, reshape how textual content is written and communicated. These models have the potential to generate scientific content that is indistinguishable from that written by humans. Hence, LLMs carry severe consequences for the scientific community, which relies on the integrity and reliability of publications. This research paper presents a novel ChatGPT-generated scientific text detection method, AI-Catcher. AI-Catcher integrates two deep learning models, multilayer perceptron (MLP) and convolutional neural networks (CNN). The MLP learns the feature representations of the linguistic and statistical features. The CNN extracts high-level representations of the sequential patterns from the textual content. AI-Catcher is a multimodal model that fuses hidden patterns derived from MLP and CNN. In addition, a new ChatGPT-Generated scientific text dataset is collected to enhance AI-generated text detection tools, AIGTxt. AIGTxt contains 3000 records collected from published academic articles across ten domains and divided into three classes: Human-written, ChatGPT-generated, and Mixed text. Several experiments are conducted to evaluate the performance of AI-Catcher. The comparative results demonstrate the capability of AI-Catcher to distinguish between human-written and ChatGPT-generated scientific text more accurately than alternative methods. On average, AI-Catcher improved accuracy by 37.4%.

ai-catcher, detection, detection method, (16 more...)

arXiv.org Artificial Intelligence

2403.00828

Country:

Asia > Singapore (0.04)
Asia > Middle East > Jordan > Amman Governorate > Amman (0.04)
Asia > Middle East > Iran (0.04)

Genre: Research Report > New Finding (0.48)

Industry:

Education (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback